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. Abstract 
^■ 

f—*) ■ Designing sliort DNA words is a problem of constructing a set (i.e., code) of n DNA strings 

, (i-e., words) with the minimum length such that the Hamming distance between each pair of 

words is at least k and the n words satisfy a set of additional constraints. This problem has appli- 
cations in, e.g., DNA self-assembly and DNA arrays. Previous works include those that extended 
results from coding theory to obtain bounds on code and word sizes for biologically motivated 
constraints and those that applied heuristic local searches, genetic algorithms, and randomized 
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^ ■ algorithms. In particular, Kao, Sanghi, and Schweller [16] developed polynomial-time random- 

ized algorithms to construct n DNA words of length within a multiplicative constant of the 
smallest possible word length (e.g., 9- max{logn, k}) that satisfy various sets of constraints with 
high probability. In this paper, we give deterministic polynomial-time algorithms to construct 
DNA words based on derandomization techniques. Our algorithms can construct n DNA words 
Jy-^ I of shorter length (e.g., 2.1 log n -I- 6.28fc) and satisfy the same sets of constraints as the words 

ff^ ■ constructed by the algorithms of Kao et al. Furthermore, we extend these new algorithms to 

\ construct words that satisfy a larger set of constraints for which the algorithms of Kao et al. do 

not work. 

o 

CN ■ Keywords: DNA word design, deterministic algorithms, derandomization. 

:t ■ 1 Introduction 
>< 

■ Building on the work of Kao, Sanghi, and Schweller [16], this paper considers the problem of 

designing sets (codes) of DNA strings (words) satisfying certain combinatorial constraints with the 
length as short as possible. Many applications depend on the scalable design of such words. For 
instance, DNA words can be used to store information at the molecular level [6], to act as molecular 
bar codes for identifying molecules in complex libraries [6,7,20], or to implement DNA arrays [3]. 
For DNA computing, inputs to computational problems are encoded into DNA strands to perform 
computation via complementary binding [1,25]. For DNA self-assembly, Wang tile self-assembly 
systems are implemented by encoding glues of Wang tiles into DNA strands [2, 24-26]. 
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A set of DNA words chosen for such apphcations typicahy need to meet certain combinatorial 
constraints. For instance, hybridization should not occur between distinct words in the set, or even 
between a word and the reverse of another word in the set. For such requirements, Marathe et al. [18] 
proposed the basic Hamming constraint (Ci), the reverse complement Hamming constraint (C2), 
and the self- complementary constraint (C3). In addition to Ci,C2, and C3, Kao et al. [16] further 
considered certain more restricting shifting versions {Ca,C5,Cq) of these constraints which require 
Ci, C2, and C3 to hold between alignments of pairs of words [5]. 

Kao et al. [16] also considered three constraints unrelated to Hamming distance. The GC content 
constraint (C7) requires that a specified fraction of the bases in a word are G or C. This constraint 
gives the words similar thermodynamic properties [21-23]. The consecutive base constraint (Cg) 
limits the length of any run of identical bases in a word. Long runs of identical bases can cause 
hybridization errors [4,5,21]. The free energy constraint (Cg) requires that the difference in the 
free energies of two words is bounded by a small constant. This constraint helps ensure that the 
words in the set have similar melting temperatures [5, 18]. 

Furthermore, it is desirable for the length i of the words to be as small as possible. The 
motivation for minimizing I is in part because it is more difficult to synthesize longer DNA strands. 
Also, longer DNA strands require more DNAs to be used for the respective application. 

There have been a considerable number of previous works in the design of DNA words [5, 6, 9- 
13,15,17-20,22,23]. Most of the existing works are based on heuristics, genetic algorithms, or 
stochastic local searches and do not provide analytical performance guarantees. Notable exceptions 
include the work of Marathe et al. [18] that extends results from coding theory to obtain bounds 
on code size for biologically motivated constraints. Also, Kao et al. [16] formulated an optimization 
problem that takes as input a desired cardinality n and produces n words of length £ that satisfy 
a specified set of constraints, while minimizing the length £. Kao et al. introduced randomized 
algorithms that run in polynomial time to construct words whose length £ is within a constant 
multiplicative factor of the optimal word length. However, with a non-negligible probability, the 
constructed words do not satisfy the given constraints. The results of Kao et al. are summarized 
in Table 1 for comparison with ours. 

This paper presents deterministic polynomial-time algorithms for constructing n desired words 
of length within a constant multiplicative factor of the optimal word length. As shown in Table 1, 
our algorithms can construct words shorter than those constructed by the randomized algorithms 
of Kao et al. [16]. Also, our algorithms can construct desired words that satisfy more constraints 
than the work of Kao et al. has done. Our algorithms derandomize a randomized algorithm of Kao 
et al. Depending on the values of k and n, different parameters of derandomization can be applied 
to minimize the length £ of words. Our results are summarized in Table 1. 

An Erratum The conference version of this work [14] has claimed a set of results based on 
expander codes. As we announced at our conference presentation of this work, those results are 
false. Those results have been removed from this full version. 

Organization of the Remainder of This Paper Section 2 gives some basic notations and the 
nine constraints Ci through Cg for DNA words. Section 3 discusses how to design a set of short 
DNA words satisfying the constraints Ci and C4. Section 4 discusses how to construct short DNA 
words under additional sets of constraints. Section 5 concludes the paper with some directions for 
further research. 
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Codes 


Randomized Algorithms 


Deterministic Algorithms 




(Kao et al. [16]) 


(this paper) 


>Vl,4 


see Wi^6 


I = \ C\ lOgn + C2K, 




£ = 9 maxjlog n, k} 


£ = t + k 




£ = lU maxjiog n, fc| 


I = I +2k 


^1^3,7,8 


£ = j^rlOmaxjlogn, A;} 


£ = ^{t + 2k) + 0{l) 




no result 


i = t + 2k when ^ < 7 < ^ 

e = + j^2k + 0{d) when d > 3 




l = TI maxjlog n, fc} 
when cj > 4L) + Fmax 


e = 3e + 2k when cr > + r^ax 



Table 1: Comparison of word lengths. The constraints Ci through Cg are defined in Section 2. 
Wi,4 is a code of n words that satisfies Ci and C4. Code Wi^^e satisfies Ci through Cg. Codes 
yVi^7, Wi^3,7,8, Wi^Si and Wi^6,9 are similarly defined. The output parameters i and £* are the 
lengths of the constructed words. The constraint parameter k is the maximum of the dissimilarity 
parameters for the associated subset of Ci through Cg; the constraint parameter d is the run- 
length parameter for Cs; the constraint parameter a, D and Tmax sue free-energy parameters for 
Cg, where D and Fmax suce defined in Section 4.5. The design parameters ci and C2 can be used 
to control the lengths of the constructed words, where ci is any real number greater than 2, and 
C2 = f jlog ( (e^„2)in2 ) + 2-5 - As examples, for ci = 2.1, i* = [2.11ogn + 6.28fc], and for 

ci = 3, i* = \3 log n + 4.76A:] . For simplicity, we omit the ceiling notation from the right-hand sides 
of expressions for i in the table. The results of this work summarized in this table are corollaries of 
Theorems 9, 13, 15, 17, 19, 21, and 23. The lengths i* and k used in these theorems are typically 
slightly smaller than those used in this table. 

Technical Remarks Throughout this paper, all logarithms log have base 2 unless explicitly 
specified otherwise. 

2 Preliminaries 

This paper considers words on two alphabets, namely, the binary alphabet = {0,1} and the 
DNA alphabet = {A, C, G, T}. 

Let X = xi ■ ■ ■ X£ he a word where Xi belongs to an alphabet 11. The reverse of X, denoted 
by X^, is the word xgXi-i ■ ■ ■ xi. The complement of X, denoted by X'^, is • • • x^, where if 
n is the binary alphabet Hb = {0,1}, then O'^ = 1 and 1^ = 0, and if IT is the DNA alphabet 
Ud = {A,C,G,T}, then A'^ = T, C^ = G, = C, and T^ = A. For integer i and j with 
1 ^ ^ ^ J ^ ^1 ■ ■ ■ j] denotes the substring Xi ■ ■ ■ Xj of X. The Hamming distance between two 
words X and Y of equal length, denoted by H{X,Y), is the number of positions where X and Y 
differ. 

Next we review the nine constraints Ci through Cg as defined in [16]. Let W be a set of 
words of equal length £. The constraints are defined for W. For naming consistency, we rename 
the Self-Complementary Constraint of [16] to the Self Reverse Complementary Constraint in this 
paper; similarly, we rename the Shifting Self-Complementary Constraint of [16] to the Shifting Self 
Reverse Complementary Constraint in this paper. 

1. Basic Hamming Constraint Ci{ki): Given an integer ki with ^ > A;i > 0, for any distinct 
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words Y,X eW, 

H{Y,X) > h. (1) 

This constraint limits non-specific hybridization between a word Y and the Watson-Crick 
complement of a distinct word X (and by symmetry between the Watson-Crick complement 
of a word Y with a distinct word X). 

2. Reverse Complementary Constraint C2{k2)'- Given an integer k2 with ^ > A:2 > 0, for 

any distinct words Y,X ^ W, 

H{Y,X^^) > k2. 

This constraint limits hybridization between a word Y and the reverse of a distinct word X. 



3. Self Reverse Complementary Constraint Cs{ks): Given an integer ^3 with ^ > ^3 > 0, 
for any word y € W, 



H{Y, y^^) > k3 



This constraint prevents a word Y from hybridizing with the reverse of itself. 

4. Shifting Hamming Constraint 04(^4): Given an integer k^ with i > k^ > 0, for any 

distinct words Y,X G W, 

H{Y[1 ■ ■ ■ i],X[{£ -i + l)---e])>ki-{i-i) ior all £ > i > £ - k^. (2) 

This constraint is a stronger version of the constraint Ci applied to every pair of a prefix of 

Y and a suffix of X of equal length i with i > i > i — ki, and a length-adjusted lower bound 
k^ — {i — i) for the Hamming distance. 

5. Shifting Reverse Complementary Constraint C^ik^): Given an integer /cs with (. > 
^5 > 0, for any distinct words Y,X G W, 

H{Y[l---i],X[l---i]^^) > k^-{l-i); and 
H{Y[{t-i + l)---llX[{i-i + l)---t]^^) > k5- (i-i) foi alley i>i-k5. 

This constraint is a stronger version of the constraint C2 applied to every pair of a prefix of 

Y and a prefix of X of equal length i and also every pair of a suffix of Y and a suffix of X 
of equal length i with i > i > i — k^ and a length-adjusted lower bound ^5 — — i) for the 
Hamming distance. 

6. Shifting Self Reverse Complementary Constraint CQ{kQ): Given an integer kg with 
i>kG>0, for any word Y eW, 

H{Y[l---i],Y[l---i]^^) > k6-{£-i)] and 
H{Y[{e-i + l)---i],Y[{e-i + l)---i]^^) > k6-{£-i) ioi all e>i>i-k6. 

This constraint is a stronger version of the constraint C3 applied to every prefix of Y and 
every suffix of Y of length i with i > i > i — k^ and a length-adjusted lower bound kg — {i — i) 
for the Hamming distance. 
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7. GC Content Constraint (77(7): Given a real number 7 witli 1 > 7 > 0, 7 fraction of 
tlie characters (e.g., cliaracters, [7^] characters, or 7^ + 0(1) characters) in each word 
yew are G or C. 

The GC content affects thermodynamic properties of a word [21,23]. Therefore, having the 
same ratio of GC content for all the words helps ensure similar thermodynamic characteristics. 

8. Consecutive Base Constraint C^{d): Given an integer d > 2, no word in W has more 
than d consecutive bases. 

In some applications, consecutive occurrences (also known as runs) of the same base increase 
annealing errors. 

Note that if d = 1 and W is a set of binary words, then W consists of at most two words, 
of which one word starts with and alternates between and 1, and the other word is the 
complement of the former word. The requirement that d > 2 rules out this trivial case. 

9. Free Energy Constraint Cg{a): Given a real number ex > 0, for any two distinct words 

Y,X eW, 

|FE(y) -FE(X)| < a, 

where FE(Z) denotes the free energy of a word Z. See Section 4.5 for the definition of a 
particular free energy function FE considered in [16] and this paper. 

This constraint helps ensure that the words in the set W have similar melting temperatures, 
which allows multiple DNA strands to hybridize simultaneously at a temperature [20]. 

The lemma below summarizes some simple properties of constraints Ci{ki) through CQ{kQ) and 
Csid). 

Lemma 1 (see, e.g., [16]). 

1. If C/i{k) holds, then Ci{k) also holds. 

2. For each Cp of the first six constraints, if k > kp and Cp{k) holds, then Cp{kp) also holds. 

3. For two integers d> d' >2, if C^{d') holds, then Cs{d) also holds. 

4. For each Cp of the first six constraints, if W is set of n distinct binary words (respectively, 
DNA words) of equal length i and satisfies Cp{kp), then i > max{log n, fcp} (respectively, 
i > max{log4 n, kp}). 

Proof. Statement 1 follows from the fact that Ci{k) is the same as the case i = i in Inequality (2) 
for C/i{k). Statements 2 through 4 are also straightforward. □ 



Technical Remarks In this work, we interpret the terms X[l ■ ■ ■ i] , X[{£ — i + 1) ■ ■ ■ i] , 
y[l • • • i]^^\ and ¥[{£ - i + 1) • • • i]^^ in the definitions of C^ik^) and Ceika) as {X[l ■ ■ ■ i)])^^ , 
{X[{i-i + l) ■ ■ ■ i])^'^, {Y[l ■ ■ ■ i])^^, and {Y[{i-i + l) ■ ■ ■ , respectively. However, it would also 
be reasonable to interpret these terms in a subtly different manner as {X^^)[l ■■■i], {X^'-'')[{£ — 
i + !)■■■ I], {Y^^)[l---i], and {Y^^)[{e - i + 1) ■ ■ ■ £]. 
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3 Designing Words for Constraints Ci{ki) and C4(/c4) 



In this section, we give a deterministic polynomial-time algorithm, namely, DetWords (Algo- 
rithm 1), which can be used to construct a code VVi,4 of n DNA words of length i* = \ci log n + C2k'] 
for a range of positive constants ci and C2 to satisfy constraints Ci{ki) and (74(^4), where k = 
max{ki, k/j^}. 

Algorithm DetWords takes n, i, ki, and k^ as input and then outputs an n x £ binary matrix. 
We can view the rows of this binary matrix as a code of n binary words of length £. In turn, we can 
convert these binary words into DNA words by replacing and 1 with two distinct DNA characters. 
The remainder of this section will focus on constructing binary words. Also, for convenience, we 
will refer to binary row vectors, binary words, and DNA words interchangeably when there is no 
risk of ambiguity. 

We design Algorithm DetWords by derandomizing a randomized algorithm in [16]. The basic 
idea for Algorithm DetWords is to implicitly generate a random nxi binary matrix M by assigning 
or 1 with equal probability 1/2 to each of the n£ positions in M independently. We then derandomize 
the assignment at each position to choose or 1 one position at a time based on conditional 
expectations of the number of pairs of distinct rows and their shifted prefixes and suffixes that 
satisfy Ci(/ci) and C4(A;4). 

More specifically. Algorithm DetWords works as follows. It first creates an empty nxi binary 
matrix. It then fills the empty entries one at a time with or 1. Before the algorithm chooses 
or 1 to fill an empty entry, it computes two expectations. The first expectation is the term Eq at 
Line 7 in Algorithm 1. Informally, this expectation is the expected number of times Inequalities 

(1) and (2) are satisfied if the current empty entry is filled with 0. The second expectation is the 
term Ei at Line 8 in Algorithm 1. Informally, this expectation is the expected number of times 
Inequalities (1) and (2) are satisfied if the current empty entry is filled with 1. These expectations 
are formally defined in Equation (3) below. According to the manner in which Equation (3) counts 
how many times Inequalities (1) and (2) are satisfied, a set of n words of length i can satisfy or 
fail these inequalities exactly (2)- (1 + 2(^4 — 1)) times in total. In particular, a set of n words 
of length i satisfies Constraints Ci{ki) and 04(^4) if and only if it satisfies Inequalities (1) and 

(2) exactly (g)- (1 -|- 2(A;4 — 1)) times and fails time. Furthermore, when i is sufficiently large, 
an empty nxi binary matrix is expected to satisfy Inequalities (1) and (2) strictly greater than 
(2)' ~'~ ^(^4 ~ ^)) ~ ^ times. With this lower bound and the linearity of expectations. Algorithm 
DetWords can choose to fill each empty entry with or 1 one at a time to arrive at a set of n 
words of length £ which satisfies Inequalities (1) and (2) exactly (2)- (1 -|- 2(A;4 — 1)) times and thus 
satisfies Constraints Ci{ki) and (74(^4). That is. Algorithm DetWords chooses to fill an empty 
entry with or 1 whichever yields a larger expected number of times Inequalities (1) and (2) are 
satisfied. 

To choose a sufficiently large £ for Algorithm DetWords, let 5 be any positive real num- 
ber. Let ci = 2 + 6. Let C2 = ^ |log ^ (c^_'2)in2 ) ~^ ~ In^}" ^ ~ iiiaxj/ci, A;4}. Let 
£* = \ci logn -|- C2k~\. Theorem 9 below shows that, by setting £ = £*, Algorithm DetWords deter- 
ministically constructs a code VVi,4 of n DNA words of length i* that satisfies constraints Ci{ki) 
and (74(^4). Theorem 9 also shows that this construction takes 0{n^{£*)^) time. 

The remainder of this section provides details to elaborate on the above overview. In Section 3.1, 
we define a polynomial-time computable expectation that will be used by Algorithm DetWords for 
the purpose of derandomization. In Section 3.2, we give Algorithm DetWords in Algorithm 1. The 
word length £* above is determined analytically and for the binary alphabet; in Section 3.3, we 
discuss how to improve this word length computationally and with a larger alphabet, i.e., the DNA 
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alphabet. 



3.1 A Polynomial-Time Computable Expectation for Derandomization 

To describe Algorithm Det Words in Algorithm 1, we first give some definitions and lemmas. 

Definition 1. Given n,£, ki, and k^, an n x i binary matrix M is called a {ki, k4)-distance matrix 
if the set of the n rows of M satisfies constraints Ci(fci) and (74(^4). 

Lemma 2. An (kijk^)- distance matrix M of dimension n x £ can be converted into a code Wi,4 
of n DNA words of length i that satisfies Ci(fci) and C4(A:4). 

Proof. As discussed in the overview at the start of this section, we first view the rows of M as 
a code of n binary words of length i. Then, we convert these binary words into DNA words by 
replacing and 1 with two distinct DNA characters. □ 

Definition 2. Let M he an n x i matrix, where each {p, q)-tli entry is 0, 1, or a distinct unknown 
Xp^q. Such a matrix is called a partially assigned matrix. 

Now consider a partially assigned matrix M of dimension nx £ as a random variable where each 
unknown Xp^q can assume the value of or 1 with equal probability 1/2. Next consider the expected 
number of ordered pairs of distinct rows and in M that satisfy constraints Ci{ki) and 04(^4) 
where Y = r^ and X = rp. As a first attempt [14], we have wished to use this expectation in 
Algorithm DetWords for the purpose of randomization. However, it is not clear how to compute 
this expectation in polynomial time. Therefore, in Algorithm DetWords, we will use a different 
expectation ExpCount(M, /ci, ^4) that also works for derandomization but can be computed in 
polynomial time. The expectation ExpCount(M, fci, ^4) is developed as follows. 

• Ei(M, a, /3, ki) denotes the event that r^ and rp satisfy Inequality (1) for Ci(fci) with Y = r^ 
and X = ri^. 

• E^^M, a, P, k4,i) denotes the event that r^ and satisfy case i of Inequality (2) for 04(^4) 
Y = ra and X = rp. 

• ExpE^(M, ki) denotes the expected number of unordered pairs of distinct a and /5 for which 
^i(M,a,/3,/ci) holds. 

• ExpE4(M, A;4, z) denotes the expected number of ordered pairs of distinct a and (3 for which 
E4{M, a, /3, k4,i) holds. 

Note that for ExpE]^(M, we count unordered pairs of a and f3 but for ExpE4(M, A;4, i), we 
count ordered pairs. This difference is due to the following reasons. Y and X are symmetric in 
Inequality (1); therefore, a and f3 are symmetric for Ei. In contrast, Y and X are symmetric in 
Inequality (2) only for i = £ but asymmetric for all other i; therefore a and /? are symmetric for 
El only for i = £ but asymmetric for all other i. 

Now, let 

£-1 

ExpCount(M, ki, k^) = ExpEi(M, max{/ci, k^}) + ^ ExpE4(M, ^4, i) (3) 

Note that in the right-hand side of Equality (3), the second argument of ExpE^ is maxj/ci, k^} rather 
than ki as used in the definition of constraint Ci{ki). Also, the upper limit of the summation is 
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i — 1 rather than i and the lower hmit is i — + 1 rather than £ — k4^ as used in the definition of 
constraint 04(^4). We will justify these details in Lemma 3 and its proof below. 

We next develop two expressions for ExpCount(M, ^4) as alternatives to Equality (3) in 
order to analyze and efficiently compute ExpCount(M, ki, ^4). 

For an event of a probability space, let E denote the complement of E, and let Pr {E) denote 
the probability of E. For a real-valued random variable V, let Exp (V) denote the expectation of 
V. 

Equalities (4) and (5) below in conjunction with Equality (3) give one of two alternative ex- 
pressions for ExpCount(M, ki, k^). 



ExpEi(M,max{fci,A;4}) = ^ |l - Pr (^^i(M, a, /3, max{/i:i, A;4})j | ; (4) 

l<a</3<n 

ExpE4(M,fc4,i) = Yl ({1 - Pr (^4(M, a, /3, A;4, i)) } + 

l<a</3<n 

{l-Pr (£;4(M,/?,a,A;4,i))}). (5) 

For ki, ki, k = max{/ci, /C4}, and a binary matrix M' of dimension n x i, consider the following 
two functions: 

• Vi{M' , k) denotes the number of unordered pairs of distinct a and /? such that rows and 
r'p of M' satisfy Inequality (1) for Ci{k) with Y = r'^ and X = r'^. 

• V4(M',A;4) denotes the number of triplets {a,P,i) such that distinct rows and in M' 
satisfy case i of Inequality (2) for C4(A;4) with Y = and X = rjs, where n > a ^ f3 > 1 and 
i-l>i>£-k4 + l. 

Note that Vi is an integer function and Q) > Vi{M',k) > 0. Similarly, V4 is an integer function 
and n{n - l)-{ki - 1) > Vi{M',k4,) > 0. Consequently, Vi{M',k) + Vi{M',ki) is an integer and 
Q)-(l + 2(fc4-l)) >yi(M',fc) + l^4(M',fc4) >0. 

Next we combine the random variable M and the functions Vi and V4 to form two random vari- 
ables Vi{M,k) and V4(M, ^4). Then, the following equalities give the other alternative expression 
for ExpCount(Af, ki, k^). 

ExpEi(M,max{A;i,A;4}) = Exp (Vi(M, A;)) ; (6) 
e-i 

ExpE4(M,A:4,i) = Exp (y4(M, ^4)) ; (7) 
ExpCount(M, fei,/c4) = Exp (Fi(M, A;)) + Exp (y4(M, A;4)) . (8) 
Lemmas 3 through 5 below analyze ExpCount(M, A;i,/c4). 
Lemma 3. Let M be a partially assigned matrix of dimension n x i. If 

ExpCount(Af, ki,k4) > Q • (1 + 2(^4 - 1)) - 1, (9) 

then there exists an assignment of 's and 1 's to the unknowns in M so that the resulting binary 
matrix M' is a {ki^k^)- distance matrix. 
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Proof. Recall that for every binary matrix M" generated from M, Vi{M",k) + V4{M",k4) is an 
integer and Q) • (1 + 2(/c4 - 1)) > Vi{M" , k) + ViiM" ,k4). Therefore, Inequalities (9) and (8) 
imply that there exists a binary matrix M' generated from M such that Vi{M', k) + V4{M', k/^) = 
(2)- (1 + 2(/c4 - 1)). Then, since Q) > Vi{M',k) and n{n - l)-{k4 - 1) > V4{M',k4), we have 
Vi{M', k) = (2) and Va{M', k) = n{n - 1)-(A;4 - 1). 

Next, since Vi{M',k) = (2) and there are (2) unordered pairs of distinct rows in Af, the n 
rows of the binary matrix M' satisfy Ci{k). Since k = maxj/ci, ^4}, by Lemma 1(2), the n rows of 
M' satisfy Ci{ki). 

Likewise, the n rows of M' satisfy Ci(/c4). Now observe that Inequality (1) for Ci^k^) is the 
same as case i = i m Inequality (2) for C^^k^). Therefore, the n rows of M' satisfy case i = i 
in Inequality (2) for C4(A;4) as well. Next, since V4(M',A;4) = n{n — 1)-(A;4 — 1) and there are 
n[n — l)-{ki — 1) triplets (a, /3, i) with n > a ^ [i > 1 and I — l>i>i — k^ + l, the n rows of 
M' satisfy Inequality (2) of C^^k^) for I— l>i>l — k^ + l. Furthermore, since case i = £ — k/^ 
in Inequality (2) for C4(A;4) always holds, the n rows of M' satisfy the entire 6*4(^4) constraint as 
well. 

In sum, the n rows of M' satisfy both constraints Ci{ki) and (74(^4). This finishes the proof. □ 

Lemma 4. Let M he a partially assigned matrix of dimension n x i. Assume that the {p,q)-th 
entry of M is an unknown. Let Mq (respectively, Mi) be M with the {p,q)-th entry assigned 
(respectively, 1). Then 

ExpCount(M, ki, k^) = -• Exp Count (Mq, ki, k^) + --Exp Count (Mi, ki, k^). 

Proof. This lemma follows from Equality (8), the linearity of expectations Exp (Vi(M, A;)) and 
Exp (V4(M, ^4)), and the fact that M is considered a random variable where each of the unknown 
entries is independently assigned or 1 with equal probability 1/2. □ 

Lemma 5. Let k = maxjfci, A;4}. Given ra,rp,ki,k4, and i as the input, each of the probabilities 
in the right-hand sides of Equalities (4) and (5) can be computed in 0{i + k) time. 

Proof. The specified probabilities can be computed in essentially the same manner. Here, we only 
show how to compute Pr ^£'i(M, a, /3, maxj/ci, ^4})^ in the desired time complexity. Let s be the 
number of positions at which r^ and assume values of or 1 and are not unknowns. Let t be 
the number of these s positions where and assume different binary values. Then, 




It is elementary to first determine s and then compute the right-hand side of Equality (10) in 
0{£ + log(£ - s) + k-t) total time, which is 0{i + k) time. □ 

3.2 Algorithm DetWords for Designing Words for Ci{ki) and C4(A;4) 

With ExpCount(M, A;i, ^4) defined and analyzed in Section 3.1, we describe Algorithm DetWords 
in Algorithm 1. 

We analyze the correctness and computational complexity of Algorithm DetWords (Algorithm 1) 
with several lemmas and a theorem below. Lemmas 6 and 7 first analyze the existence of {ki, k^)- 
distance matrices. 
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Algorithm 1 DetWords(n,£, A;i, A;4) 



1: Input: integers n, £, ki, and k^. 

2: Output: a {ki, A;4)-distance matrix M of dimension n x i. 
3: Steps: 

4: Construct a partially assigned matrix M of dimension nx £ where every entry is an unknown. 
5: for p = 1 to £ do 
6: for g = 1 to n do 

7: Compute Eg = ExpCount(Mo, fci, ^4), where Mq is M with the unknown at the {p,q)-th 
entry set to 0. 

8: Compute Ei = ExpCount(Mi, A;i, /C4), where Mi is M with the unknown at the {p,q)-th. 

entry set to 1. 
9: if Eo > El then 

10: Update M by setting the unknown at the {p, q)-th entry to 0. 

11: else 

12: Update M by setting the unknown at the {p, q)-tli entry to 1. 

13: end if 
14: end for 
15: end for 

16: return M, which is now a binary matrix. 



Lemma 6. Given n, ki, k^, and k = maxjfci, A:4}, if £ satisfies the following two inequalities 

2k < £ (11) 

< ^- fcloge - fclog- - 21ogn + 2Iogfc, (12) 

k 

then I satisfies Inequality (9) in Lemma 3 and thus there exists a (ki, k^) -distance matrix of dimen- 
sion n X £. 

Proof. Throughout this proof, we assume £ > 2k. Consider a partially assigned matrix M of 
dimension nx i where every entry is an unknown. To prove this lemma by means of Equalities (3), 
(4), and (5), we will solve for £ the following equivalent inequality of Inequality (9): 

'^(1 + 2(^4-1))-! 
< 1 1 - Pr (^Ei (M, a, /3, max{ fei , A;4})) } 

l<a</3<n 

+ Yl Yl ({1 - Pr (^4(M, a, fc4, i)) } + {1 - Pr (^E^{M, p, a, k^J))]) .(13) 

i=£-k4+l l<a<[S<n 

Simplifying the above inequality, we have the following equivalent inequality: 

1 > X] (^Ei{M, a, (3, max{fci , ^4})^ 

l<a<l3<n 

e-1 

+ Yl J2 (Pr (S4(M, a, /3, ^4, i)) + Pr [e^{M, ^,a,k4,i))) . 

j=£-fc4+l l<Q</3<n 
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Working out the probabilities in the above inequality, we have the following equivalent inequality: 



V-"/ j=0 j=£_fc4 + l ^ ^ [ j=0 ^ 

Simplifying the above inequality, we have the following equivalent inequality: 

Next, replacing by /c in Inequality (14) and moving the terms on the right-hand side to the 
left-hand side, we obtain the following non-equivalent inequality: 

^ , fe-i /ff\ ^-1 k-{i-i)-i \ 

Note that if i satisfies Inequality (15), then H satisfies Inequality (14) and thus Inequalities (13) 
and (9). Therefore, we will now solve Inequality (15) for i as follows. 

We will find a lower bound of the left-hand side of Inequality (15). For this purpose, we first 
bound the term in the rightmost summation of Inequality (15). Since i > 2k and ^ — 1 > i, we 
have i > 2-{k — {i — i)) and thus 



J J - \k -{i-i) 

Furthermore, for all integers s with i — i — l>s>0, since 

i + {s + l) ^ 
k-{£-i) + {s + l) - ' 

we have 

i + {s + l) \1 ( i + s \ i + {s + l) 1 



k-{l-i) + {s + l)J 2 \k - {^ - i) + sj k - {i - i) + [s + I) 2 

By applying Inequality (16) once and applying Inequality (17) iteratively I — i times, we have 

'i 



,r-\.>-r (18) 

This finishes the bounding of the term in the rightmost summation of Inequality (15). 

We next bound the term in the leftmost summation of Inequality (15). Since ^ > 2k, we have 
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Plugging Inequalities (19) and (18) into the left-hand side of Inequality (15), we have 

/ \ /'^-i /A k-{e-i)-i 



> l-n^k^-r-^] 2- (since (;j<(:^) ). (20) 



Now consider the following inequality: 



1_„2^2. j^-J 2-^>0. (21) 

Note that by Inequality (20), if I satisfies Inequality (21), then £ satisfies Inequalities (15), (14), 
(13), and (9). Consequently, the lemma follows from the fact that Inequality (21) is equivalent to 
Inequality (12). □ 

Lemma 7 below solves Inequalities (11) and (12) in Lemma 6 for a useful range of ^. 

Lemma 7. Given n > 2, ki, k/i, and k = max{/ci, /C4} > 1, if we set 

Cl = 2 + (5 for any real 6 > 

and 

then i* = \ci logn + C2A;] > 2k satisfies Inequalities (11) and (12) in Lemma 6, and thus there exists 
a {ki,k 4) -distance matrix of dimension n x i* . (As examples, when 5 = 1, i* = [3 log n + 4.76A;] ; 
and when 5 = 0.1, t = [2.1 log n + 6.28/c] 

Proof. Since C2 > 2 by calculus, we have > 2k, satisfying Inequality (11). Below we prove that 
satisfies Inequality (12). Consider the function f{x) = (c2 — 2.5) + (ci — 2)x — log (c2 + cix) and 
let = Next observe that 

< /(z*) = (c2-2.5) + (ci-2).i^f^-log(c2 + cri^f^ 

^ < (c2-2.5)A; + (ci-2)logn-fclog(c2 + ^^ 

^ < (c2fc + cilogn)-A;loge-A:log(^^^^±f^) -21ogn-21og/c 

=^ < ^* — A;loge — /clog (^) — 21ogn — 21ogA:. 

Thus if f{z*) > 0, then H* satisfies Inequality (12). To prove f{z*) > 0, we next solve the following 
equation: 

= fix) 
<^ = (ci - 2) - , ^ ^1 , „ 

^ C2 + CIX = (,^4^)1^2 

rr = 1 C2 

^ ^ (ci-2)ln2 Cl • 
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Continuing the proof for f{z*) > 0, observe that since 

/"(^) = (c2+c?xVln2 > 0' ^'^^ minimum 
functional vahie of f{x) occurs at Xmm = (ci~2) in2 ~ cf ' ^'^^^ ^o show f{z*) > 0, we only need to 
show /(xmin) > by observing that the following four inequalities are all equivalent to /(xmin) > 0. 

The lemma follows from the fact that Inequality (22) follows from the definition of C2. □ 

Lemma 8 below sets up the base case and the induction step of the iterative derandomization 
process of Algorithm Det Words in Algorithm 1. 

Lemma 8. Given n > 2, ki, k^, and k = max{A;i,A;4} > 1, if we set i = I*, then the following 
statements hold for Algorithm DetWords. 

1. (Base Case) At the end of Line 4 of Algorithm 1, the matrix M satisfies Inequality (9) in 
Lemma 3, namely, 

ExpCount(M, ki,ki) > Q • (1 + 2{k4 - 1)) - 1. 

2. (Induction Step) For each of the ni* iterations of the nested for-loops in Algorithm 1, at the 
end of Line 13, the matrix M also satisfies the above inequality. 

Proof. 

Statement 1 follows from Lemmas 7, 6, and 3. 

Statement 2 follows from Statement 1 and Lemma 4. □ 

Theorem 9 below summarizes the performance of Algorithm DetWords. 

Theorem 9. Given n > 2, ki, k4^, and k = maxj/ci, /C4} > 1, if we set i = £^ , then the following 
statements hold for Algorithm DetWords. 

1. Algorithm 1 outputs a code VVi,4(n, /ci, ^4) of n binary words (i.e., DNA words) of length 
i* that satisfies Ci{ki) and Ci{ki). 

2. The word length i* is within a constant multiplicative factor of the smallest possible word 
length for a code of n binary words of equal length that satisfies Ci{ki) and C4(/c4). 

3. Algorithm 1 runs in 0{n'^{£*)^) time. 
Proof. 

Statement 1. This statement follows from Lemmas 8 and 3 and the fact that the matrix output 
by Algorithm 1 is a binary matrix (i.e., has no unknowns). 

Statement 2. This statement follows from the definition of i* and Lemma 1(4). 
Statement 3. We first analyze the running times of steps in Algorithm 1 as follows. 
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1. Line 4 takes 0{ni*) time to generate the initial M. 

2. Then for each of the ni* iterations of the nested for-loops to compute Eq, Line 7 does not 
expHcitly compute Mq. Instead, Algorithm 1 will first compute ExpCount(M, /ci, ^4) for the 
initial M where every entry is an unknown. This initialization task takes 0(n^(£*)^) time 
by Lemma 5 and Equalities (3), (4), and (5). Then, Line 7 will update Eq incrementally by 
recomputing 



Pr (^^i(M,a,/3,max{A:i,A:4})j ,Pt \^Ei{M,a, I3,h,i)j , and Pr (^S4(M, /3, a, A;4, i) j (23) 

for a = q, all /3 ^ q with n > /5 > 1, and all i with — \ > i > — + 1. By Lemma 5, 
these recomputations and thus the incremental updating of Eq take 0(n(£*)^) time in total 
per loop iteration. In sum, the total running time of updating Eq over the n£* loop iterations 
is 0{n'^{tf). 

3. Once Eq is updated. Algorithm 1 will update Ei at Line 8 in 0(1) time per loop iteration 
using the linearity equality in Lemma 4. 

4. Once Eq and Ei are updated. Algorithm 1 compares them at Line 9 and then updates M 
accordingly at Line 10 or 12 in 0(1) time per loop iteration. 

5. In sum, the total running time of the nP' iterations of the nested for-loops is dominated by 
the total running time of updating Eq over the n(* loop iterations and thus is 0{in?{l*Y) 
time. 

6. Outputting the final matrix M at Line 16 takes 0{nl*) time. 

In summary, the time complexity of Algorithm 1 is dominated by the total running time of the 
nested for-loops and thus is 0(n^(^*)'^) = 0{n?'{k + logn)'^). □ 

Technical Remarks In the proof of Statement 3 of Theorem 9, the incremental updating of 
Eq at Line 7 can be made somewhat more efficient by modifying the proof of Lemma 5 with 
more elaborate but still straightforward algorithmic details. Specifically, the right probability in 
Expression (23) can be updated in 0{k) time instead of 0{l) time. Also, each of the middle and 
right probabilities in Expression (23) can be updated in 0{ki — {£* — i)) = 0{k) time instead 
of 0{i) time. Thus the total time for incrementally updating Eq at Line 7 is 0{n£*k) per loop 
iteration, which is somewhat less than 0(n(£*)^). For the sake of brevity, we omit the details of 
these improvements in this paper. 

3.3 Improving Word Length i* Computationally and with a Larger Alphabet 

The word length £* is obtained analytically. In order to make the analysis of i* manageable, we 
sacrifice the quality of £*. In this section, we discuss two improvements of i* by computation. 

Improving Word Length i* Computationally Lemma 11 computationally improves the word 
length i* by means of binary search. 

Lemma 10. Let M be a partially assigned matrix of dimension n x I where every entry is an un- 
known. Given n, ki, k^, k = max{fci, k^}, and I as the input, ExpCount(M, k\, k^) can be computed 
in 0{k + (^4)^ -I- log£) time. 
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Proof. By Equalities (3), (4), and (5) and a similar analysis to the proof of Lemma 6, we have 




ExpCount(M,A:i,A:4) = ( ^ ) • (1 + 2(A;4 - 1)) 

It is elementary to evaluate the right-hand side of this equality in 0{k + (^4)^ + log£) time. □ 

Lemma 11. Given n, ki, k^, and k = max{fci, k^} as the input, it takes 0{{k + /c| + log(logn + 
/c)) log(log n + fc)) time to compute the smallest i that satisfies Inequality (9) in Lemma 3, namely, 

ExpCount(M, ki,ki) > ■ (1 + 2(A;4 - 1)) - L 

Proof. By Lemmas 7, 6, and 3, we use i* as the initial upper bound for the desired smallest i. We 
then use binary search and Lemma 10 to find this smallest desired i. This search process takes 
0(log£*) applications of Lemma 10 and thus runs in 0((/c + (k^)'^ + log^*)log£*) time, which is 
0{{k + kl + log(logn + k)) log(logn + k)) time . □ 



Further Improving Word Length i* with a Larger Alphabet The smallest i obtained by 
Lemma 11 can be further improved by replacing the binary alphabet with the DNA alphabet in 
the definition of a partially assigned matrix and modifying Algorithm 1 accordingly. This alphabet 
change will shorten the smallest i obtained by Lemma 11 because it is intuitive to show that a 
random DNA matrix of dimension n x i has a larger probability to be a (fci, A;4)-distance matrix 
than a random binary matrix of the same dimension. The analysis of the performance of such a 
modified Algorithm 1 remains essentially the same, and the smallest desired i to input into the 
modified Algorithm 1 can be computed in the same manner and time complexity as by Lemma 11. 
For the sake of brevity, we omit the details of this modification. 



4 Designing Words for More Constraints 

In this section, we give deterministic polynomial-time algorithms to construct short DNA words 
for the following subsets of the constraints Ci, . . . ,Cg based on Algorithm 1 : 

• Ci through Cq (see Theorem 13 in Section 4.1) 

• Ci through Cj (see Theorem 15 in Section 4.2); 

• Ci, C2, C3, C7, and Cs (see Theorem 17 in Section 4.3); 

• Ci through Cg (see Theorems 19 and 21 in Section 4.4); and 

• Ci through Cq, and Cg (see Theorem 23 in Section 4.5). 

For the word constructions in this section, we will use Lemma 1(2) to simplify the constructions, 
and it follows from Lemma 1(4) that the simplifications do not sacrifice the word length by more 
than a constant multiplicative factor. 

To implement the simplifications, we first clarify the notation i* by attaching the parameters 
ki and A;4 to it as follows. 
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Given n > 2, ki > 1 and > 1, let 

6 = any positive real, 
Cl = 2 + 6, 

C2 = - (log ( ^i- ^ + 2.5 - -^1 , and 

2 \ ^ V (ci - 2) In 2 y In 2 J 

i*{ki,k4) = [ci- log n + C2- max{A;i, A:4}] . 
4.1 Designing Words for Constraints Ci through Cg 

Lemma 12 below shows how to transform a binary code that satisfies Ci{ki) and 04(^:4) to a DNA 
code that satisfies Ci{ki) through CQ{kQ). 

Lemma 12. 

1. Let B be a code ofn distinct binary words of equal length £1^4 that satisfies Ci{ki) and 04(^4). 
Given B, k2, k^, k^, and /cg o-s the input, we can deterministically construct a code Wi^e of 
n distinct DNA words of equal length that satisfies Ci{ki), C2{k2), C'i{k^), C4(A;4), C^{k^), 
and CQ^ke). 

2. The length of the words in Wi^e is ^1^4 + max {/c2, ^3 , /C5 , /cg } • 

3. The construction takes 0{n{ii^4 + max {/c2, ^3, k^, k^})) time. 

Proof. Let k = max {/c2, k^, k^, k^}. We construct VVi,h.6 with the following steps: 

1. Convert the binary code B into a DNA code by changing to the character A and changing 
1 to the character T in each word. Let V denote the set of the new words. 

2. Append k copies of the character C at the left end of each word in V. Let Wi^g be the set 
of the new words. 

It is clear that this construction takes 0{n{ii^4 + k)) time, proving Statement 3. It is also clear 
that the words in Wi^g have equal length £1^4 + k, proving Statement 2. To prove Statement 1, 
we observe that the two construction steps are deterministic and Wi^^e consists of n distinct DNA 
words of equal length. Below we verify that Wi^g satisfies Ci{ki), 02(^2), 6*3(^3), (74(^4), C^{k^), 
and C6(A;6). 

• That Wi^6 satisfies Ci(/ci) and C^iku) follows directly from the assumption that B satisfies 
these two constraints. 

• To check C2{k2) and C3(A;3), consider two words Y and X in Wir^^ (Y ^ X for C2{k2), but 

Y = X for Cs{ks)). Since the leftmost k characters in Y are all C. For these two constraints, 
these C's are compared with A, T, or G in X^^' . Therefore, the Hamming distance between 

Y and X^^ is at least k. Since k > k2 and k > h^, constraints C2{k2) and C3(A;3) hold for 

• To check C^{k^) and CQ{kQ), since k > k^ and k > k^, by Lemma 1(2) we only need to 
check C^{k) and CQ{k). Consider two words Y and X in Wir^e 0^ ^ X for C^{k), but 
y = X for C^{k)). Let I denote £1^4 + k. Also consider i where i > i > £ — k. Let 
j = k — (i — i). From the definitions of the constraints Ci(fci) through CQ{kQ), we have i > k. 
Thus, i > j and Y[l ■ ■ ■ i] has at least j characters. The leftmost j characters of Y[l ■ ■ ■ i] 
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are all C. For these two constraints, these C's are compared with characters A, T, or G in 
{X[l ■ ■ ■ i])^'-^ . Therefore the Hamming distance between and (X[l • • • i])^*^ is at 

least j = k — (l — i), as required by C^{k) and CQ{k). By a symmetrical argument for the 
right ends of (^[(-^ — i + 1) • • • £])^^ and Y[{£ — i + 1) ■ ■ ■ £], the Hamming distance between 
Y[{e - i + 1) • • • and {X[{£ - i + 1) • • • i])^'^ is at least k - {£ - i), as required by C^ik) and 
Ceik). 

□ 

Theorem 13 below uses Theorem 9 and Lemma 12 to show how to construct a DNA code that 
satisfies Ci{ki) through C6(A;6)- 

Theorem 13. 

1. Given n > 2, ki > 1, k2, k^, /c4, k^, and k^ as the input, we can deterministically construct 
a code Wi^e of n distinct DNA words of equal length that satisfies Gi{ki), G2{k2), C^{hi), 
Gi{k4), C^ik^), andCeikQ). 

2. The length of the words in Wi^e i'^{ki, /C4) + max{A;2, A;3, k^, k^}. 

3. The construction tofees Ti^4(n, £*(A;i, ^4), /ci, /C4) + 0(n(log n + max{A;i, /c2, /cs, /C4, /cs, /cg}) time, 
where Ti^4(n, ^*(A;i, ^4), /si, /C4) is the running time of the caZ/ DetWords(n, ^4), /ci, /E4). 

Proof. We construct Wir^^e with the following steps: 

1. Let £1,4 = t{ki,kA). 

2. Construct a binary code B = DetWords(n, ^1^4, ki, /C4) by means of Theorem 9. 

3. Construct Wi^e by means of Lemma 12 using B, k2, k-^, k^, and k^ as the input. 

With the above construction, this theorem follows directly from Theorem 9 and Lemma 12. 

□ 

4.2 Designing Words for Constraints Ci through C7 

Lemma 14 below shows how to transform a binary code that satisfies Ci{ki) and C4{k4) to a DNA 
code that satisfies Ci{ki) through (77(7). 

Lemma 14. 

1. Let B be a code of n distinct binary words of equal length £1^4 that satisfies Gi{ki) and C4(A;4). 
Given B, k2, k^, k^, kg, and 7 as the input, we can deterministically construct a code Wir^y 
ofn distinct DNA words of equal length that satisfies Ci{ki), C2{k2), Cs^k^), Ci{k4), C5{k5), 
C6{k6), and Cji'j). 

2. The length of the words in Wi^7 is ^1^4 + 2 max {A;2, k^, k^, k^}. 

3. The construction takes 0(n(^i^4 + max {k2, k^, k^, /ce})) time. 

Proof. Let k = max {k2, k-^, k^, kg}. Let i = £1^4 + 2k. We construct Wi^7 with the following steps: 

1. Append k copies of 1 to each of the left and right ends of each word in B. Let B' denote the 
set of the new binary words of equal length i. 
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2. Choose [7^] arbitrary (e.g., evenly distributed) positions among !,...,£ (see [16]). 

3. For each word in B', at each of the above chosen [7^] positions, change to C and change 1 
to G, while at all the other positions, change to A and change 1 to T. Let Wi^r be the set 
of the resulting DNA words (see [16]). 

With the above construction. Statements 2 and 3 clearly hold. To prove Statement 1, observe 
that the above construction steps are deterministic and Wi^j consists of n distinct DNA words of 
equal length. Next, by a proof similar to but simpler than that of Lemma 12, B' satisfies Ci{ki) 
to C^^kQ) as constraints on binary words. Then, since the substitutions at Step 3 do not change 
Hamming distances for Ci{ki) and do not decrease Hamming distances for C2{k2) through C6(A;6), 
these six constraints also hold for Wi^y. Moreover, it follows from the substitutions at Step 3 that 
6-7(7) holds for Wi^7. □ 

Theorem 15 below uses Theorem 9 and Lemma 14 to show how to construct a DNA code that 
satisfies Ci{ki) through 67(7). 

Theorem 15. 

1. Given n >2,ki> 1, k2, k^, k^, k^, kg, andj as the input, we can deterministically construct 
a code Wi^7 of n distinct DNA words of equal length that satisfies Ci{ki), C2{k2), 63(^3), 
Ci{ki), C^ik^), Ceike), and 07(7). 

2. The length of the words in yVi^7 is i*{ki, ^4) + 2max{A;2) ^3, ^5; ^e}- 

3. The construction takes Ti^4{n,i*{ki,k4^),ki,k4) + 0{n{logn + msix{ki,k2,k3,k4,kr,jkQ}) time, 
where Ti^4(n, ^*(A;i, ^4), /ci, /C4) is the running time of the caZZ DetWords(n, £*(/ci, ^4), /ci, /E4). 

Proof. We construct Wi^^j with the following steps: 

1. Let £1,4 = t{ki,k4). 

2. Construct a binary code B = DetWords(n, £1^4, ki, /C4) by means of Theorem 9. 

3. Construct Wi^7 by means of Lemma 14 using B, k2, k^, k^, k^, and 7 as the input. 

With the above construction, this theorem follows directly from Theorem 9 and Lemma 14. □ 

4.3 Designing Words for Constraints Ci, C2, C3, C7, and Cg 

To eliminate long runs in words to satisfy Cs{d), we first detail an algorithm in Algorithm 2, which 
slightly modifies a similar algorithm of Kao et al. [16] to increase symmetry. Given a binary word 
X and d as the input, this algorithm inserts a character into X at the end of each interval of 
length d — 1 from both the left end of X and the right end of X toward the middle of X. The 
algorithm also inserts two characters at the middle of X. The inserted characters are complementary 
to the ending character of each interval or complementary to the middle two characters of X. 
The complementarity of the inserted characters and the spacings of the insertions ensure that the 
resulting word X' does not have consecutive O's or consecutive I's of length more than d. The 
symmetrical manner in which the inserted characters are added to X facilitates the checking of 
constraints C2(/e2) and 6*3(^3). 

Lemma 16 below shows how to transform a binary code that satisfies Ci(fei) to a DNA code 
that satisfies Ci{ki), C2{k2), 6*3(^3), 6*7(7), aiid Cs{d). The proof of this lemma uses Algorithm 2 
to satisfy Cs{d). 
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Algorithm 2 BreakRuns(X, d) 
1: Input: a binary word X = X1X2 ... of length i and an integer d > 2, where i is assumed to 
be even. 

2: Output: a binary word X' of length i' that has at most d consecutive O's or at most d 
consecutive I's, where i' = £ + 2[ ^^^^-^^ J + 2. 

3: Let u = d — 1, s = [^J , t = su, and mid = |. 
4: for 1 < i < s do 

5: Let ai = {xiuY and /3j = (x^.j^+i)''. 
6: end for 

7: Let A = (3;mid)''(3;mid+i)''- 

8: Split X into three segments L = X[l ■ ■ ■ t], U = X[{t + 1) ■ ■ ■ (i-t)], and R = X[{i-t + l) ■ ■ ■ i]. 

9: Let L' = Xi. . . XuaiXu+l ■ ■ ■X2uOl2X2u+l ■ ■ ■ XtO-s- 

10: Let R' = ^sxe-t+i ■ ■ ■ xj.2uP2Xe-2u+i ■ ■ ■ xe-u^ix^-u+i . . .xi. 

11: Let U' = xt+i . . . XmidAxmid+i • • ■ xe_f 

12: Let X' be the concatenation of L', U', and R'. 

13: return X'. 



Lemma 16. 

1. Let Bq be a code of n distinct binary words of equal length £q that satisfies Ci(fci). Given Bq, 
^2, k-s, 7, and d as the input, we can deterministically construct a code Wi^a.y^s of n distinct 
DNA words of equal length that satisfies Ci{ki), C2{k2), C3(A;3), (^7(7), and Cs{d). 

2. The length of the words in Wi^3,7,8 is ^^tI^o + 2max{/c2, k^}) + 0{1). 

3. The construction takes 0{n{iQ + max{k2,k3})) time. 

Proof. Our construction of Wi^sj^s is similar to the construction of Wi^e iii Lemma 14 with 
additional work of using Algorithm 2 to break long runs in binary words. Specifically, we construct 
Wi~3,7,8 with the following steps: 

1. If i() is odd, then append at the right end of each word in Bq; otherwise, do not change the 
words in Bq. Let Bi be the set of the resulting words. Let ii be the length of the resulting 
words; i.e., if £0 is odd, then £1 = £q + 1, else £1 = £0. 

2. Let k = max{A:2, k^}. Append k copies of 1 at each of the left and right ends of each word in 
Bi. Let B2 be the set of the new binary words. Let £2 be the length of the new words; i.e., 
£2=£i + 2k. 

3. Apply Algorithm 2 to each word in B2. Let B3 be the set of the output binary words. Let £3 
be the length of the new words; i.e., £3 = h + 2l^^j^] + 2 = + 2 max{A;2, A;3}) + 0(1). 

4. Choose [7-^3] arbitrary (e.g., evenly distributed) positions among 1, ... ,£3 (see [16]). 

5. For each word in B3, at each of the above chosen [7^3] positions, change to C and change 
1 to G, while at all the other positions, change to A and change 1 to T. Let Wi^sj^s be the 
set of the resulting DNA words (see [16]). 

With the above construction. Statements 2 and 3 clearly hold. To prove Statement 1, observe 
that the above construction steps are deterministic and Wi^sj^s consists of n distinct DNA words 
of equal length. We verify Ci(/ci), C2(A;2), ^3(^3), (77(7), and Cs{d) as follows. 
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1. Since Bq satisfies Ci{ki), Bi also satisfies Ci{ki). 

2. Next, by a proof similar to but simpler than that of Lemma 12, B2 satisfies Ci{ki) through 
C^{k^) as constraints on binary words. 

3. From the spacings of the insertions made by Algorithm 2, the insertions made at Step 3 do 
not decrease Hamming distances for Ci{ki) through C-i{h^), B3 continues to satisfy these 
three constraints. 

4. Further from the spacings and the complementarity of the characters inserted by Algorithm 2, 
S3 additionally satisfies Cgid) as a constraint on binary words. 

5. Since the substitutions made at Step 5 do not decrease Hamming stances for Ci{ki) through 
CsC^s)) VVir^3,7,8 continues to satisfy Ci(fci) through 6*3(^3). 

6. Also, it follows from the substitutions made at Step 5 that 6*7(7) holds for Wi^3j,8- 

7. Finally, these substitutions do not increase lengths of consecutive occurrences of a character, 
Cs{d) holds for >Vi^3,7,8- 

□ 

Theorem 17 below uses Theorem 9 and Lemma 14 to show how to construct a DNA code that 
satisfies Ci{ki), C2{kz), Cz{k^), Cr^j), and Cs{d). 

Theorem 17. 

1. Given n > 2, ki > 1, k2, h^, 7, and d as the input, we can deterministically construct a code 
y^i'^3j,8 ofn distinct DNA words of equal length that satisfies Ci{ki), C2{k2), C^{hi), (77(7), 
and Cs{d). 

2. The length of the words in Wi^sj^s is ^(^(^1, ki) + 2max{A;2, A;3}) + 0(1). 

3. The construction takes Ti^4{n,i*{ki,ki),ki,ki) + 0{n{logn + niax{ki,k2,k3}) time, where 
Ti^4^{n,i*{ki,ki),ki,ki) is the running time of the callDetWords{n,£*{ki,ki),ki,ki). 

Proof. We construct Wir^3^7^8 with the following steps: 

1. Let io = ^*{h,ki). 

2. Construct a binary code Bq = Det Words (n, ki, ki) by means of Theorem 9. 

3. Construct Wi~3,7,8 by means of Lemma 16 using Bq, k2, k^, 7, and d as the input. 

With the above construction, this theorem follows directly from Theorem 9 and Lemma 16. □ 

Technical Remarks. We can reduce the word length of yVi^3 j,8 in Theorem 17(2) by simplifying 
Algorithm Det Words to satisfy only Ci{ki) rather than both Ci{ki) and C/i{ki). For the sake of 
brevity, we omit the details of this simplification. 
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4.4 Designing Words for Constraints Ci through Cs 

This section gives two ways to construct a DNA code that satisfies Ci{ki) through Cs{d) in Theo- 
rems 19 and 21. 

Lemma 18 below gives a way to transform a binary code that satisfies Ci(fei) to a DNA code 
that satisfies Ci(/ci) through Cs{d). 

Lemma 18. Assume < 7 < 

1. Let B he a code of n distinct binary words of equal length Iq that satisfies Ci{ki). Given B, 
k2, ks, ki, k^, /cg; 7; c-f^d d as the input, we can deterministically construct a code Wi^g of 
n distinct DNA words of equal length that satisfies Ci{ki), C2{k2), C-i{k'i), C4{ki), C^{k^), 
Ceike), C7(7), andCsid). 

2. The length of the words in Wi^s ^-5 + 2max{A;2) ^3) ^4) k^, fcg}. 

3. The construction takes 0{n{lQ + max{fc2, A;3, /c4, ^5, ko})) time. 

Proof. Let k = max{A;2! ^3, ^4) ^5) ^e}- Let £ = lQ + 2k. This proof assumes 7 > ^- This assumption 
is without loss of generahty, since if 7 < |, we can modify by symmetry the construction steps 
below to construct a DNA code whose AT content is 1 — 7 fraction of the characters in each word. 
We construct Wi^s with the following steps: 

1. Append k copies of 1 at each of the left and right ends of each word in B. Let B' be the set 
of the new binary words, which have equal length ^. 

2. Partition the integer interval [1, i] into integer subintervals Zi, Z2, . . . ,Zs for some s such that 
(1) each subinterval consists of at most d integers and at least one integer and (2) the total 
number of integers in the odd-indexed subintervals is [7^] . 

3. For each word in S', change every (respectively, 1) whose position is in the odd-indexed 
subintervals to C (respectively, G), and also change every (respectively, 1) whose position is 
in the even- indexed subintervals to A (respectively, T). Let Wir^% be the set of the resulting 
DNA words. 

We now prove the three statements of this lemma. First of all. Statement 2 clearly holds. As 
for the other two statements, since d>2 and ^ < 7 < ^j^, the partition of at Step 2 exists 
and can be computed in 0{i) time in a straightforward manner. With this fact. Statement 3 clearly 
holds. To prove Statement 1, observe that the above construction steps are deterministic and Wir^s 
consists of n distinct DNA words of equal length i. We verify Ci{ki) through Csid) as follows. 

By an analysis similar to but simpler than the proof of Lemma 12, B' satisfies Ci{ki) through 
CQ{kQ) as constraints on binary words. At Step 3, the substitutions do not change Hamming dis- 
tances for Ci{ki) and do not decrease Hamming distances for C2(/c2) through C6(A;e), so Wi^g 
continues to satisfy Ci(fci) through CQ{kG). The aggregate size bound of the odd- indexed subin- 
tervals at Step 2 ensures that Wi^s additionally satisfies (^7(7). The individual size bounds of the 
subintervals at Step 2 and the alternating CG-versus-AT substitutions between odd- indexed and 
even-indexed subintervals at Step 3 ensure that Wi^s satisfies C^^d) as well. □ 

Theorem 19 below uses Theorem 9 and Lemma 18 to give our first way to construct a DNA 
code that satisfies Ci{ki) through Cs{d). 

Theorem 19. Assume -^j^ < 7 < 



21 



1. Given n > 2, ki > 1, k2, k^, k^, k^, kg, 7, and d as the input, we can deterministically 
construct a code Wi^^s of n distinct DNA words of equal length that satisfies Ci(fei), 6*2(^2); 
Csiks), Ci{k^), CM, Ceike), ^7(7), andCs{d). 

2. The length of the words in Wi^s is i*{ki, ki) + 2max{k2, k^, k4^, k^, k^}. 

3. The construction taA;es Ti 4(11, £*(A;i, /ci), A;i, /ci) + 0(n(log n + max{A;i, /c2, /ca, /C4, /cs, /cg}) time, 
where Ti^i{n,i*{ki,ki),ki,ki) is the running time of the callDetWords{n,£*{ki,ki),ki,ki). 

Proof. We construct Wir^s with the following steps: 

1. Let io = e{ki,ki). 

2. Construct a binary code B = DetWords(n, ki, ki) by means of Theorem 9. 

3. Construct Wi^g by means of Lemma 18 using B, k2, k^, k4^, k^, k^, 7, and d as the input. 

With the above construction, this theorem follows directly from Theorem 9 and Lemma 18. □ 

Lemma 20 below gives our second way to transform a binary code that satisfies Ci{ki) to a 
DNA code that satisfies Ci(fci) through Cs{d). 

Lemma 20. Assume d > 3. 

1. Let Bq be a code of n distinct binary words of equal length £9 tho-t satisfies Ci{ki). Given Bq, 
k2, ks, ki, k^, /cg; 7; (^nd d as the input, we can deterministically construct a code Wir^g of 
n distinct DNA words of equal length that satisfies Ci{ki), C2{k2), C^{k^), 6*4(^4), 05(^5), 
CM, Ciil), andCsid). 

2. The length of the words in Wi^s 3~r^o + 3^2max{A;2, ^3, A;4, ^5, /cg} + 0{d). 

3. The construction takes 0{n{iQ + max{A;25 ^3; ^4^ ^e})) time. 

Proof. Let k = max{A;2, /cs, A;4, /C5, fcg}. We construct Wi^s with the following steps: 

1. For Bq, partition each word into [^^^1 sub- words of length d — 1 except that the rightmost 
sub-word may be shorter. For each sub-word Z, insert a bit at the right end of Z that is 
complementary to the original rightmost bit of Z. Let Bi be the set of the new binary words. 
Let ii be the equal length of the new words; i.e., ii = io + [^^1 = htt^o + 0{1). 

2. For Bi, append one copy of 1 at the left end of each word and one copy of at the right end 
of each word. Let B2 be the set of the new binary words. Let £2 be the equal length of the 
new words; i.e., ^2 = ^1 + 2 = + 0(1). 

3. For B2, append [^r^l copies of length-d binary word 11 • • • 110 at each of the left and right 
ends of each word. Let B3 be the set of the new binary words. Let £3 be the equal length of 
the new words; i.e., £3 = £2 + 2\^']d = ^£0 + ^2k + 0{d). 

4. For B3, for the leftmost [7-^3] characters in each word, change every (respectively, 1) to C 
(respectively, G), and for the remaining £3 — [7^3] characters in each word, change every 
(respectively, 1) to A (respectively, T). Let Wi^s be the set of the resulting DNA words. The 
new worlds have equal length £3. 
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We now prove the three statements of this lemma. First of ah, Statements 2 and 3 clearly 
hold. To prove Statement 1, observe that the above construction steps are deterministic and Wir^g 
consists of n distinct DNA words of equal length £3. We verify Ci{ki) through Cs{d) as follows. 

• Since Bq satisfies Ci{ki), the codes Bi, B2, B3, and Wi^g all satisfy Ci{ki). 

• That B3 satisfies C2(/c2) through C^^kQ) follows from Step 3 and an analysis similar to the 
proof of Lemma 12. Consequently, Wir^g also satisfies C2{k2) through CQ{kQ). 

• From Step 4, Wi^s also satisfies 6*7(7). 

• From Steps 1 through 3, B3 satisfies Cs{d). Consequently, Wi^s satisfies Cs{d) as well. 

□ 

Theorem 21 below uses Theorem 9 and Lemma 20 to give our second way to construct a DNA 
code that satisfies Ci(/ci) through C^{d). 

Theorem 21. Assume d > 3. 

1. Given n > 2, ki > 1, k2, k^, k/^, k^, kg, 7, and d as the input, we can deterministically 
construct a code Wi^s of n distinct DNA words of equal length that satisfies Ci{ki), C2{k2), 
Gsiks), C^iki), C^{k^), Ceike), 67(7), andCs{d). 

2. The length of the words in Wi^s ^1) + 3^2max{/i:2, k^, k^, k^, k^} + 0{d). 

3. The construction takes Ti^4{n,i*{ki,ki),ki,ki) + 0{n{logn + max{ki,k2,k3,k4,k5,kQ}) time, 
where Ti^4{n,i*{ki,ki),ki,ki) is the running time of the callDetWords{n,£*{ki,ki),ki,ki). 

Proof. We construct Wi^^s with the following steps: 

1. Let eo = ^*{ki,ki). 

2. Construct a binary code Bq = DetWords(n, io, ki, ki) by means of Theorem 9. 

3. Construct Wi^s by means of Lemma 20 using Bq, k2, ks, k^, fcs, k^, 7, and d as the input. 
With the above construction, this theorem follows directly from Theorem 9 and Lemma 20. □ 



Technical Remarks. As with Theorem 17(2), we can reduce the word lengths of Wi^s hi The- 
orems 19(2) and 21(2) by simplifying Algorithm Det Words to satisfy only Ci{ki) rather than both 
Ci{ki) and C^iki). 

Furthermore, for the word length formulas in Lemma 20(2) and Theorem 21(2), the left and 
middle terms in each formula are decreasing functions of d while the right term is an increasing 
function of d. By Lemma 1(3), we can first computationally find an integer d' such that d' > d and 
d' minimizes the value of the respective length formula and then apply Lemma 20 or Theorem 21 

to this d' instead of d to compute Wir^s- Analytically, for example, when d > 1, a reasonable 

initial approximation for d' would be + !> where i = io for Lemma 20(2) and £ = i*{ki, ki) for 
Theorem 21(2). 
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4.5 Designing Words for Constraints Ci through Ce, and Cg 

We now show how to construct DNA words that satisfy the free energy constraint Cg{(T). 

Following the approach of Breslauer et al. [8], the free energy of a DNA word X = xiX2 ■ ■ - xi 
is approximated by the formula 

e-i 

FE(X) = correction factor + ^^r^j.^j;.^^, 

i=l 

where T^^y is an integer denoting the pairwise free energy between base x and base y. 

Building on the work of Kao et al. [16], for simplicity and without loss of generality, we denote 
the free energy of X to be 

i-i 
1=1 

with respect to a given pairwise energy function T. In other words, the correction factor is set to 0. 

• Let 

Tmax and Fmin be the maximum and the minimum of the 16 entries of F, respectively. 

• Let D — Fjuax Ljuin. 

Theorem 22 below gives a way to transform a DNA code that satisfies Ci(/ci) through CQ{kQ) 
to a DNA code that satisfies Ci{ki) through CQ{kQ) and Cg{4:D + Fmax)- 

Theorem 22 (Kao, Sanghi, and Schweller [16]). 

1. Let Bq be a code of n distinct DNA words of equal length Iq that satisfies Ci{ki), C2{k2), 
Csik^), (74(^4), C5{k^), and CQ{kQ). There is a deterministic algorithm that takes Bq and 
F as the input and constructs a code Wir^Q^g of n distinct DNA words of equal length that 
satisfies Cq{4:D + Fmax) in addition to satisfying Ci{ki) through CQ{kQ). 

2. The length of the words in Wi^6,9 is 21q. 

3. The construction takes O(min{?i£o log^o? log"'^ ^0 + 't-^o}) time. 

Theorem 23 below uses Theorems 22 and 13 to give a way to construct a DNA code that satisfies 
Ci(fci) through Ceike) and Cg{4D + F^ax)- 

Theorem 23. 

1. Given n > 2, ki > 1, k2, k^, k^, k^, kg, and F as the input, we can deterministically construct 
a code VVi^6,9 of n distinct DNA words of equal length that satisfies Ci{ki), C2{k2), C^ik^), 
C^iki), C5{k5), Ceike), and Cg{4D + T^,,). 

2. The length of the words in Wi^e.g is io = 2{£*{ki, k^) + max{/c2, /cs, k^, k^}). 

3. The construction takes Ti^4{n, i*{ki, /C4), ki, ^4) + 0{mm{n£() log^Oi log'^'^ io + uIq}) time, 
where 7i^4(n, £*(A:i, ^4), fci, ^4) is the running time of the callT)etWords{n,i*(ki,k/^),ki,k/^). 

Proof. We construct yVi^6,9 with the following steps: 

1. Construct a DNA code Bq by means of Theorem 13 using n, ki, k2, k^, k^, k^, and fcg as the 
input. 

2. Construct VVi^6,9 by means of Theorem 22 using Bq and F as the input. 

With the above construction, this theorem follows directly from Theorems 22 and 13. □ 
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5 Further Research 



In this paper, we have introduced deterministic polynomial-time algorithms for constructing n 
DNA words that satisfy various subsets of the constraints Ci through Cg and have length within a 
constant multiplicative factor of the shortest possible word length. However, no known algorithm 
can efficiently construct similarly short words that satisfy all nine constraints. It would be of 
significance to find efficient algorithms to construct short words that satisfy all nine constraints. 
Furthermore, it would be of interest to design efficient algorithms to construct short words for other 
useful constraints. In particular, observe that the constraints Ci through Cg are based on pair-wise 
relations of words. Conceivably, our derandomization techniques are applicable to other classes of 
codes based on m-wise relations of words for constant m. 
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